Branch detection updates #667

JelmerBot · 2024-12-20T20:47:40Z

This PR solves #660, adding labels and probability parameters to BranchDetector.fit() that override the input HDBSCAN object. Cases where overridden clusters form multiple connected components in the minimum spanning tree are detected. The component labels are returned as branch labels in that case. The condensed and linkage trees for those clusters are set to None, allowing scripts to detect what happened.

While working on this PR, I noticed that the branching code could be simplified extensively. This PR revert some of the changes I made when I introduced the branch detection code. I also found and fixed some issues with the hierarchy simplification code that applies a persistence threshold and added a persistence threshold parameter to the clustering code.

Finally, I made small changes in _hdbscan_boruvka.pyx to expose the computed core distances and neighbours. This allows me to use the implementation in another project.

review-notebook-app · 2024-12-20T20:47:45Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

JelmerBot · 2024-12-20T20:49:57Z

hdbscan/branch_data.py

+class BranchDetectionData(object):
+    """Input data for branch detection functionality.
+
+    Recreates and caches internal data structures from the clustering stage.
+


Moved BranchDetectionData to a new file so branches.py can import hdbscan_.py without cyclical imports.

JelmerBot · 2024-12-20T20:51:56Z

hdbscan/hdbscan_.py

-                self._finite_index = get_finite_row_indices(X)
-                clean_data = X[self._finite_index]
+                finite_index = get_finite_row_indices(X)
+                clean_data = X[finite_index]
                internal_to_raw = {
-                    x: y for x, y in zip(range(len(self._finite_index)), self._finite_index)
+                    x: y for x, y in zip(range(len(finite_index)), finite_index)
                }
-                outliers = list(set(range(X.shape[0])) - set(self._finite_index))
+                outliers = list(set(range(X.shape[0])) - set(finite_index))


I now recover the finite index from the condensed tree, so finite_index does not have to be stored explicitly anymore. This reverts changes I made when I introduced the branch detection code.

JelmerBot · 2024-12-20T21:15:12Z

hdbscan/tests/test_hdbscan.py

-    assert_array_almost_equal,
-    assert_raises,
+    assert_array_almost_equal


assert_raises gives import error on CI/CD. I replaced it with pytest.raises in all tests.

lmcinnes · 2024-12-30T21:57:58Z

This looks great. Let me know when you are ready to have it merged.

JelmerBot · 2025-01-03T11:25:56Z

I think this is ready now. It contains breaking name changes for the BranchDetector, but that makes the naming consistent across repositories and these names make more sense to me.

lmcinnes · 2025-01-07T18:42:37Z

Thanks for all your work on this!

JelmerBot commented Dec 20, 2024

View reviewed changes

This was referenced Dec 20, 2024

fix plotting fast_hdbscan condensed trees #666

Merged

add branch detection functionality TutteInstitute/fast_hdbscan#32

Merged

JelmerBot marked this pull request as draft December 27, 2024 08:05

JelmerBot added 6 commits December 28, 2024 09:09

replace assert_raise with pytest.raises

04d6a4d

expose neighbors and core distances in boruvka

eefa824

add persistence threshold to hdbscan; improves simplify_hierarchy

8aa5d4c

move branch detection data

2d5c29c

consolidate branch & cluster extraction code

71389c5

add cluster override parameters

3e5e724

JelmerBot force-pushed the dev/flasc-updates branch from 4dfbe7b to 95a70f3 Compare December 28, 2024 08:54

JelmerBot added 2 commits December 28, 2024 13:50

generalize approximation graph to other lens dimensions.

5c4ffbe

update documentation

3ebc8c5

JelmerBot force-pushed the dev/flasc-updates branch from 95a70f3 to 3ebc8c5 Compare December 28, 2024 12:50

JelmerBot added 2 commits January 2, 2025 15:23

consistent parameter naming across repositories

310adf4

fix incorrect cluster override

466960a

JelmerBot marked this pull request as ready for review January 3, 2025 11:24

lmcinnes approved these changes Jan 7, 2025

View reviewed changes

lmcinnes merged commit 442c3a8 into scikit-learn-contrib:master Jan 7, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branch detection updates #667

Branch detection updates #667

JelmerBot commented Dec 20, 2024

review-notebook-app bot commented Dec 20, 2024

JelmerBot Dec 20, 2024

JelmerBot Dec 20, 2024 •

edited

Loading

JelmerBot Dec 20, 2024

lmcinnes commented Dec 30, 2024

JelmerBot commented Jan 3, 2025

lmcinnes commented Jan 7, 2025

Branch detection updates #667

Branch detection updates #667

Conversation

JelmerBot commented Dec 20, 2024

review-notebook-app bot commented Dec 20, 2024

JelmerBot Dec 20, 2024

Choose a reason for hiding this comment

JelmerBot Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

JelmerBot Dec 20, 2024

Choose a reason for hiding this comment

lmcinnes commented Dec 30, 2024

JelmerBot commented Jan 3, 2025

lmcinnes commented Jan 7, 2025

JelmerBot Dec 20, 2024 •

edited

Loading